Goto

Collaborating Authors

 speech database


PRODIS -- a speech database and a phoneme-based language model for the study of predictability effects in Polish

Malisz, Zofia, Foremski, Jan, Kul, Małgorzata

arXiv.org Artificial Intelligence

We present a speech database and a phoneme-level language model of Polish. The database and model are designed for the analysis of prosodic and discourse factors and their impact on acoustic parameters in interaction with predictability effects. The database is also the first large, publicly available Polish speech corpus of excellent acoustic quality that can be used for phonetic analysis and training of multi-speaker speech technology systems. The speech in the database is processed in a pipeline that achieves a 90% degree of automation. It incorporates state-of-the-art, freely available tools enabling database expansion or adaptation to additional languages.


Construction and Evaluation of Mandarin Multimodal Emotional Speech Database

Ting, Zhu, Liangqi, Li, Shufei, Duan, Xueying, Zhang, Zhongzhe, Xiao, Hairng, Jia, Huizhi, Liang

arXiv.org Artificial Intelligence

A multi-modal emotional speech Mandarin database including articulatory kinematics, acoustics, glottal and facial micro-expressions is designed and established, which is described in detail from the aspects of corpus design, subject selection, recording details and data processing. Where signals are labeled with discrete emotion labels (neutral, happy, pleasant, indifferent, angry, sad, grief) and dimensional emotion labels (pleasure, arousal, dominance). In this paper, the validity of dimension annotation is verified by statistical analysis of dimension annotation data. The SCL-90 scale data of annotators are verified and combined with PAD annotation data for analysis, so as to explore the internal relationship between the outlier phenomenon in annotation and the psychological state of annotators. In order to verify the speech quality and emotion discrimination of the database, this paper uses 3 basic models of SVM, CNN and DNN to calculate the recognition rate of these seven emotions. The results show that the average recognition rate of seven emotions is about 82% when using acoustic data alone. When using glottal data alone, the average recognition rate is about 72%. Using kinematics data alone, the average recognition rate also reaches 55.7%. Therefore, the database is of high quality and can be used as an important source for speech analysis research, especially for the task of multimodal emotional speech analysis.


Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Kumar, Ritesh, Singh, Siddharth, Ratan, Shyam, Raj, Mohit, Sinha, Sonal, Lahiri, Bornini, Seshadri, Vivek, Bali, Kalika, Ojha, Atul Kr.

arXiv.org Artificial Intelligence

In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.


BEA-Base: A Benchmark for ASR of Spontaneous Hungarian

Mihajlik, P., Balog, A., Gráczi, T. E., Kohári, A., Tarján, B., Mády, K.

arXiv.org Artificial Intelligence

Hungarian is spoken by 15 million people, still, easily accessible Automatic Speech Recognition (ASR) benchmark datasets - especially for spontaneous speech - have been practically unavailable. In this paper, we introduce BEA-Base, a subset of the BEA spoken Hungarian database comprising mostly spontaneous speech of 140 speakers. It is built specifically to assess ASR, primarily for conversational AI applications. After defining the speech recognition subsets and task, several baselines - including classic HMM-DNN hybrid and end-to-end approaches augmented by cross-language transfer learning - are developed using open-source toolkits. The best results obtained are based on multilingual self-supervised pretraining, achieving a 45% recognition error rate reduction as compared to the classical approach - without the application of an external language model or additional supervised data. The results show the feasibility of using BEA-Base for training and evaluation of Hungarian speech recognition systems.


What Is Natural Language Processing and How Does It Work? - Text2Speech Blog

#artificialintelligence

In 1950, Alan Turing published his famous paper titled "Computing Machinery and Intelligence". The paper proposed a test to determine if a machine was artificially intelligent. Basically, Turing said that if a machine could have a conversation with a human and trick the human into thinking the machine was a person itself, then it was artificially intelligent. This became known as the Turing Test, and passing it has been one of the most sought after goals in computer science. Passing the Turing Test would signal the birth of artificial intelligence.